Sqoop Incremental Loading

Incremental load is defined as the activity of loading only new or updated records from the database into target table. Incremental loads are useful because they run very efficiently when compared to full loads, particularly so for large data sets. Sqoop provides an incremental import mode which can be used to retrieve only rows newer than some previously-imported set of rows.
 The following arguments control incremental imports:
Sqoop supports two types of incremental imports:
  • append
  • lastmodified.
You can use the --incremental argument to specify the type of incremental import to perform. You should specify append mode when importing a table where new rows are continually being added with increasing row id values. You specify the column containing the row’s id with --check-column.  Sqoop imports rows where the check column has a value greater than the one specified with --last-value.
 
An alternate table update strategy supported by Sqoop is called lastmodified called mode. You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported.
At the end of an incremental import, the value which should be specified as  --last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify --last-value in this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import. See the section on saved jobs later in this document for more information.The process to perform incremental data load in Sqoop is to synchronize the modified or updated data (often referred as delta data) from RDBMS to Hadoop. The delta data can be facilitated through the incremental load command in Sqoop. Incremental load can be performed by using Sqoop import command or by loading the data into hive without overwriting it. The different attributes that need to be specified during incremental load in Sqoop are-
  • Mode (incremental) –The mode defines how Sqoop will determine what the new rows are. The mode can have value as Append or Last Modified.
  • Col (Check-column) –This attribute specifies the column that should be examined to find out the rows to be imported.
  • Value (last-value) –This denotes the maximum value of the check column from the previous import operation.
Login Cloudera Mysql DataBase

mysql -uroot -pcloudera
create table emp_incr(empid int primary key,ename varchar(50),esal int);
insert into emp_incr values(100,'EMP1',12000); 

insert into emp_incr values(101,'EMP2',13000);
insert into emp_incr values(102,'EMP3',14000);
insert into emp_incr values(103,'EMP4',15000);
insert into emp_incr values(105,'EMP5',16000);

New Records:-

insert into emp_incr values(106,'EMP6',14000);
insert into emp_incr values(107,'EMP7',15000);
insert into emp_incr values(108,'EMP8',16000);

Incremental Append
sqoop import --connect jdbc:mysql://localhost:3306/dwdev --username root --password cloudera --table emp_incr --target-dir=/sqoop/Import_incr -m 1;
sqoop import --connect jdbc:mysql://localhost:3306/dwdev --username root --password cloudera --table emp_incr --incremental append --check-column empid --last-value 105 --target-dir=/sqoop/Import_incr -m 1;

Incremental Last Modified
sqoop import --connect jdbc:mysql://localhost:3306/dwdev --username root --password cloudera --table emp --target-dir=/sqoop/Import -m 1; 

 sqoop import --connect jdbc:mysql://localhost:3306/dwdev --username root --password cloudera --table emp_incr --incremental lastmodified --check-column esal --last-value 15000 --target-dir=/sqoop/Import_incr -m 1;

No comments:

Post a Comment